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Abstract 



C/3 

q 

C/5 , We present a randomized parallel algorithm that computes the greatest common divisor of 

two integers of n bits in length with probability 1 — o(l) that takes O(nloglogn/logn) time 
using 0(n 6+e ) processors for any e > on the EREW PRAM parallel model of computation. 
(Nl ' The algorithm either gives a correct answer or reports failure. 

We believe this to be the first randomized sublinear time algorithm on the EREW PRAM 
for this problem. 
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1 Introduction 

The parallel complexity of computing integer greatest common divisors is an open problem (see 
[5]), and no new complexity results have been published since the early 1990s. This problem is not 
known to be either P-complete or in MC [TOj, [131 ES] ■ 

The first sublinear time parallel algorithm that uses a polynomial number of processors is due to 
Kannan, Miller, and Rudolph [12] . Adleman and Kompella pQ presented a randomized algorithm 
that runs in polylog time, but uses a superpolynomial, yet subexponential number of processors. 
The fastest currently known algorithm is due to Chor and Goldreich |6j which takes Oinj log n) time 
using 0(n 1+e ) processors. See also [22], and Sedjelmaci pjj] who showed a clear way to do extended 
GCDs in the same complexity bounds. However, all of these algorithms use the concurrent-read 
concurrent-write (CRCW) parallel RAM (PRAM) model of computation. 

The algorithms of Chor and Goldriech [6j and the author [22J can be readily modified for the 
weaker concurrent-read exclusive- write (CREW) PRAM to obtain running times of 0(n log log nj log n) 
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using a polynomial number of processors. And of course one can take a CRCW PRAM algorithm 
and emulate it on an exclusive-read exclusive-write (EREW) PRAM at a cost of a factor of O(logra) 
in the running time, giving linear time algorithms for the EREW PRAM using a polynomial number 
of processors. 

In this paper, we present what we believe is the first sublinear time, polynomial processor 
EREW PRAM algorithm for computing greatest common divisors. Note that the EREW PRAM 
is weaker than the CREW or CRCW PRAM models of parallel computation. We do make use of 
random numbers in a fundamental way. 

Theorem 1.1 There exists a randomized algorithm to compute the greatest common divisor of two 
integers of total length n in binary with probability 1 — o(l) in 0(n log log nj log n) time using a 
polynomial number of processors on the EREW PRAM. 

In the next section we describe our algorithm, and in Section [3] we prove correctness, give a 
complexity analysis, and flesh out the details of the algorithm. We conclude in Section U] with a 
simple result on the relative density of integers with large polynomially smooth divisors, which is 
needed for the analysis of the algorithm. 

2 Algorithm Description 

Define the inputs as u,v of total length n in binary. Let B, our small prime bound, be defined 
as B = Bin) := n 2 . A larger value for B can be chosen, so long as log-B = o(n), but correctness 
would be compromised if B were significantly smaller (see Section 13. ip . 

1. Find a list of primes up to B. Also, for each prime p < B, compute and save p e for e = 
1 . . . \ n/ log 2 p\ . 

2. Remove and save common prime factors of u,v that are < B, and let uq,vq denote these 
modified inputs. WLOG we assume uo > v$. 

3. Main Loop. Repeat while UiVi ^ 0. Here i indicates the current loop iteration, starting at 
i = 0. 

(a) For j := 1 to 2B log n in parallel do: 

i. Choose a^- uniformly at random from 1 ... V{ — 1. 

ii. Compute := aijUi mod Uj. 
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iii. Compute Sy as rjj with all prime factors p < B removed. 
(We elaborate on how to do this below.) 

(b) Find s, := mm,- {sij}. Let j m ; n denote the value of j for which Sj = Sjj, and for later 
reference, let a,- = a,-,- • . 

3 6 6 j mi n 



4. Uj + fj is, with probability 1 — o(l), equal to gcd(uo>^o) ( a s we show below). If we err, it is 
by including spurious factors that do not belong, so verify that Ui + v% evenly divides both 
«o,fo> an d if not, report an error. Otherwise, include any saved common prime factors found 
in step 2 above, and the algorithm is complete. 

3 Algorithm Analysis 

In this section we prove correctness, and compute the parallel complexity of our algorithm from 
the previous section. 

3.1 Correctness 

Note that in Steps 2 and 4 we handle any prime divisors < B of the gcd(u, v), so WLOG we can 
assume either gcd(w, v) = 1 or gcd(u, v) > B. 

At iteration % of the main loop, we perform the transformation 



Since Sj is equal to aiui mod v%, ignoring factors below B, this transformation will only fail to 
preserve the greatest common divisor if Oj and Vi share a common factor. Furthermore, this 
common factor must be composed only of primes exceeding B. Since a% is chosen uniformly at 
random, the probability and Vi share a prime factor larger that B is at most 



As we will see below, with high probability, the number of main loop iterations is o(n). Thus, the 
probability that any of the at values introduces a spurious factor is o(l). 



Note that in |23j . a similar, but not identical, transformation was analyzed. It was observed 

that in practice, with no removal of small prime divisors, the expected number of bits contributed 
by spurious factors was constant per main loop iteration. 




(Ui,Vi) -> (Vi,Si). 
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3.2 Runtime Analysis 

First we calculate the number of main loop iterations, and then we describe how each iteration can 
be computing in O(logn) time using a polynomial number of processors. 

3.2.1 Main Loop Iterations 

Let W := 0.5(log J B) 2 /loglog J B. Then by Theorem g^l which we prove in the next section, the 
length of Sij is smaller than < V{ by at least @(W) bits with probability at least 1/B. (Note 
that we chose 0.5 to get a clean 1/B probability - other choices for the constant can be made to 
work with the right adjustments.) 

So, the probability all 2B log n choices for j fail to have log Sjj < log V{ — W is 



So, with probability 1 — 0(l/n 2 ), logSj < logt;j — W. 

We remove roughly (log B) 2 / log log B bits each main loop iteration. Thus, the number of main 
loop iterations is 0{n log log B/(logB) 2 ) = o(n). The probability that any one loop iteration fails 
to remove the needed Q(W) bits is 0(l/n), so the probability we exceed this number of main loop 
iterations and terminate without computing an answer is o(l). 

3.2.2 Computation Cost and Algorithm Details 

Unless stated otherwise, cost is given for the EREW PRAM. For a brief overview of the cost of 
parallel arithmetic, see |22l Section 6.2]. 

Step 1. We can find the primes < B in O(log-B) time using O(B) processors (see |24j). For each 
prime p < B and e < n, we can compute p e in at most 0(log n) multiplications, each of which 
takes O(log-B) time using processors [17]. See also [16} Theorem 12.2]. As there are 

0{B j log B) primes, this is O(lognlog-B) time using nB 2+ °^ processors. 

Step 2. For each prime p and exponent e, we assign a group of processors to see if p e divides u 
but p e+1 does not. Division takes O(logn) time using n 1+e processors for any e > using the 
algorithm of Beame, Cook, and Hoover [3], giving a total processor count of 0(n 2+e B/ log B). 

The result is a vector of the form [p e ^] that lists the primes dividing u with maximal exponents. 
Since there are at most n/\ogB integers in the vector > 1, and they total at most n bits 
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(their product is < u), the iterated product algorithm of [3] can take their product in O(logre) 
time using n 1+e processors. Dividing u by this product can be done at the same cost. 

We repeat this for v, and obtain a similar vector. 

We combine these two vectors using a minimum operation, and take the product of the entries, 
to obtain the shared prime power divisors of u, v which must be saved for Step 4. 

The total cost of this step is O(logn) time using 0(n 2+<L B/ log B) processors. 

See also [5] and references therein. 

Step 3. Checking for zero takes O(logn) time using 0(n) processors. 

Step 3.(a)i For each j, choosing an n-bit number at random takes constant time using 0(n) 
processors. We reduce it modulo V{ in O(logre) time using n 1+e processors [3]. 

Step 3.(a)ii This is simply a multiplication and a division, again taking O(logn) time using n 1+e 
processors. 

Step 3.(a)iii Here we use the same method as described in Step 2 above. This is O(logn) time 
using 0(n 2+e B/ log B) processors for each j. 

Step 3. (a) And so, the total cost of this parallel step is 0(log B) time using 0(n 2+e (log n)B 2 j log B) 
processors. 

Step 3.(b) This can be done in 0(log(i?logn)) = O(log-B) time using 0(B log n) processors. 

Step 3.(c) This takes constant time using O(n) processors. 

We conclude that the cost of one main loop iteration is 0(log B) time using 0(n 2+e (log n)B 2 / log B) 
processors. Step 3.(a)iii is the bottleneck. 

Earlier we showed that the number of iterations is 0(nloglogi?/(logi?) 2 ), for a total time of 
0(n log log Bj log B) for all iterations of the the main loop. 

Step 4. This is an addition, a division, and a multiplication using the results from Step 2; O(logn) 
time using n 1+e processors. 

Clearly, the bottleneck of the algorithm is Step 3. (a). The overall complexity is 
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0(n' 



) processors, where e > 0, 



for the EREW PRAM. This completes our proof of Theorem II. li 

One could take B to be superpolynomial in n; for example, if B = exp[- v /ra] we can obtain 
a running time of roughly yjn using exp[0(-y/re)] processors. Similar results could be obtained 
from some of the CRCW PRAM algorithms mentioned in the introduction by porting them to the 
EREW PRAM and setting parameters appropriately. 

We can also obtain an 0(n/ log n) running time on the randomized CRCW PRAM; see [4] 
for how to perform the necessary main loop operations in O (log n/ log log n) time via the explicit 
Chinese remainder theorem. See also [8]. 

It would be interesting to see if this algorithm can be modified to compute Jacobi symbols 
quickly in parallel. See [9] and references therein. 

4 Numbers with Smooth Divisors 

Let P{n) denote the largest prime divisor of n. If P(n) < y we say that n is y-smooth. Let 



the number of integers < x that are y-smooth. Let u = u(x,y) := log xj logy. We will make use of 
the following lemma. 

Lemma 4.1 ([11, Corollary 1.3]) Let e > and assume u < y l ~ e . Then 



for x > y > 2. 

Note that the o(l) here tends to zero for large u, and the implied constant depends on e. Better 
results are known, but this suffices for our purposes. For additional references see [25] , and for 
references on approximation algorithms for ^(x,y) see |14j . 
We recall the definition of H}~, the kth. harmonic number as 



It is well known that Hk = log A; + 7 + 0(l/k), where 7 = 0.57721 ... is Euler's constant (for 
example, see [I5j 4.5.4]). 



tf(x,y) = #{n < x : P{n) < y}, 



^(x, y) = xu 



«(l+o(l)) 
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Fix a constant c > 0. Define B(x) to be a strictly increasing function of x, but with log-B(x) = 
o(loga;). (We are primarily interested in B(x) polynomial in logic.) Define 

log log B(x) 

F(x) := < x : n = my, P(m) < B(x), log to > VK(x)}. 

In other words, F{x) counts integers n < x where n has a i?(rr)-smooth divisor that is > exp W(x) 
Theorem 4.2 Let e > 0. For sufficiently large x we have 

F(x) > 



B{x)< l +z) ' 



Proof: Choose 5 > such that (1 + S) 3 < 1 + e. From the definition, we have 

x/exp[W(x)] . 

Fix) = V *(-,B(x: 



y=l 



y 



First, we limit the range of summation to obtain the lower bound 

x/ exp[iy (x)] 

E 

y=x/( eX p[(l+S)W(x)]) 



x/exp[W(x)] . . 

F[x) > Yl *(J' B( 7 



Next, we apply Lemma 14.11 We also observe that ■u~ n ( 1 +°( 1 )) > ■u~( 1 +< 5 )" for large u, and for a 
lower bound, we can fix u at its largest value on the interval of summation, namely u = u(x) = 
(1 + 5)W{x)/ log B{x). This gives us 

x I exp[W \x)\ 

Fix) > Yl --u-^ +5 >. 

y=x/(exp[(l+S)W(x)]) V 

Using 1/t = Ffb — H a > (1 — S) log(6/a) for sufficiently large a, we obtain that 

Fix) > x ■ 5(1 - S)W{x) ■ u -( 1+ ^ u 
> x-u- {1+S)u 

as W is a strictly increasing function of x for large x. Next we plug in for u as follows: 
log(u-^ 1+ ^ u ) = -(l + 5)ulogu 
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= -(1+*) 



(l + 5)W(x) 
\ogB(x) 



log 



(l + 5)W(x) 
logB(x) 



) 



_ (l + ^clogffQr) 

1 J loglog£(x) 

> -c(l + 5) 3 logB(x) 



log 



(1 + j)c log_B(x) 
log log B(x) 



) 



for x sufficiently large. We now have 



F(x) > x-B(x)-< 1+s ^ 



for sufficiently large x. □ 

Only a lower bound is needed for our purposes, but one can obtain an upper bound on F(x) of 
similar shape using the same general methods. 
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